
ACOUSTIC ECHO CANCELLATION SYSTEM 

FIELD OF INVENTION 

This invention relates to multi-channel acoustic echo cancellation system and 
more particularly to a cancellation system using time-varying all-pass Filtering for signal 
decorrelation. 

BACKGROUND OF INVENTION 

At present, most teleconferencing systems use a single full-duplex audio channel 
for voice communications. These systems also make use of an acoustic echo canceller to 
reduce the undesired echo resulting from the coupling between the loudspeaker and the 
microphone. To make these systems more lifelike, better and more realistic sound 
systems are required. High fidelity wide bandwidth (100 to 7000 Hz) voice 
communication system is now being used. However, in order to introduce spatial 
realism, more than one channel is needed. Therefore, future teleconferencing systems are 
expected to have more than one channel (at least stereo with two channels) of full duplex 
voice communications. 

One of the fundamental problems in stereophonic acoustic echo cancellation 
(AEC) systems is that given the input to the loudspeakers and the output of the 
microphones in the receiving room, the echo path cannot be determined uniquely. See for 
example the following references: J. Benesty,.D. R. Morgan and M. M. Sondhi, "A 
Better Understanding and an Improved Solution to the Problems of Stereophonic 
Acoustic Echo Cancellation," Preprint, Proceedings of ICASSP-97, Vol. I, pp. 303-306, 
Munich, Germany, April 21-24, 1997; J. Benesty, P. Duhamel and Y. Grenier, "Multi- 
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Channel Adaptive Filtering Applied to Multi-Channel" Acoustic Echo Cancellation/' 
Preprint, Submitted to IEEE Trans, on Signal Processing, April 1995; S. Shimauchi and 
S. Makino, '^Stereo Projection Echo Canceller with True Echo Path Estimation/' 
Proceedings oflCASSP-95, pp. 3059-3062, 1995; and M. M. Sondhi, D. R. Morgan and 
5 J. L. Hall, ^'Stereophonic Acoustic Echo Cancellation - An Overview of the Fundamental 
Problem/' IEEE Signal Processing Letters, Vol. 2, No. 8, pp. 148-151, August 1995. The 
problem is due to the correlation between the stereo signals. As a result, any adaptive 
technique used in stereophonic AEC systems fails to identify the echo path responses 
correctly. To circumvent this problem, it is necessary to develop techniques to 
10 decorrelate the stereo signals at the input to the loudspeakers without affecti-ng stereo 
perception. 

Several techniques have been proposed in the past, e.g., addition of random noise, 
modulation of signal, decorrelation, filters, inter-channel frequency shifting etc. 
However, these techniques either do not correlate the signals or destroy stereo perception 
15 completely. The interleaving comb filtering proposed in Sondhi et al. cited above only 
gives partial identification (above 1 kHz) of the echo path responses. Recently, a 
technique is proposed in Benesty et al. cited above based on non-linear processing of the 
stereo signals. However, as noted by the authors of Benesty et al., for tonal signal, the 
technique based on non-linearity cannot maintain transparency in perception (changes the 
20 pitch perception). 
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SUMMARY OF INVENTION 

In accordance with one embodiment of the present invention, a multi-channel 
acoustic cancellation system includes time-varying all-pass filtering in signal paths to 
provide decorrelation of signals. 
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IN THE DRAWINGS 

Fig. I illustrates a prior art stereophonic echo cancellation system; 
Fig. 2 illustrates a stereophonic echo cancellation system according to the present 
invention; 

Fig. 3 is a plot of time delay vs. frequency for all-pass filters with a^ ^.^ =-0.9 
^.ma. =0 in Fig- 2; and 

Fig. 4 illustrates behavior of misalignment with original signal without the all- 
pass filtered input and according to the present invention with the all-pass filtered input. 
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DESCRIPTION OF PREFERRED EMBODIMENT OF PRESENT IiNVENTION 



Fig. 1 shows the configuration of a typical stereophonic echo cancellation system 
10. The transmission room (depicted on the left) 1 1 has two microphones 13 and 15 that 
pick up the speech signal, .r, via the two acoustic paths characterized by the impulse 
responses, gi and ^2- All acoustic paths are assumed to include the microphone and/or 
loudspeaker responses. The i^' microphone output is then given by (in the frequency 
domain) 

X^{(D) = G^{a))X{(D). (I) 
[n this application, the upper-case letters represent the Fourier transforms of the 
time-domain signals denoted by the corresponding lower-case letters. The whole system 
is considered as a discrete-time system ignoring any A/D or D/A converter. These signals 
are presented through the set of loudspeakers 27 and 29 in the receiving room 21 (on the 
right in Fig. 1). Each microphone 23 and 25 in the receiving room picks up an echo (yi, 
yi in Fig. 1) from each. of the loudspeakers. Let hij be the acoustic path impulse response 
from the /' loudspeaker to the microphone. In Fig. 1 the path from speaker 27 to 
microphone 23 is /?//, the path from speaker 27 to microphone 25 is h2h the path from 
speaker 29 to microphone 25 is h22 and from speaker 29 to microphone 23 is /?/2. Then 
the echoes (yi, y:) picked up by the microphones 23 and 25 in the receiving room 21 are 
given by (in the frequency domain) 

ym = TM^.{a))X.{a)) (2) 

In the absence of any AEC, the echoes v/'s will be passed back to the loudspeaker 
17, 19 in the transmission room 11 and will be recirculated again and again. This will 
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cause multiple echoes or may even result in howling instability. Commonly used AEC 
systems use adaptive finite impulse response (FIR) filters that provide estimates of the 
echo path responses. The FIR filter coefficients are updated adaptively depending on the 
input signals to the loudspeaker and the outputs of the microphones. 
5 In the stereophonic AEC, there are four echo paths (/i//, hi2„ h2/, and hzi ) to be 

identified. We, therefore, need four adaptive filters 31-34 as shown in Fig. 1. Filter 31 is 
the estimate for the liu path, filter 32 is the estimate for the /z/2,path, filter 33 is the 
estimate for the ^2/ path and filter 34 is the estimate for the /z22 path. The estimates 

A A A A 

/j,, and /i,2 of paths to echo yi are summed at 37 and the estimates /ij, and /Jjj of paths 
10 to echo y2 are summed at 38. The output of the AEC filters (which can be thought of as 
an estimated echo) are as follows 



These estimated echoes at 37 and 38 are subtracted at adders 35 and 36 from the 
true echoes from yi and y2 giving the error signals (ei and in Fig. 1), 



These error signals are used to update the filter 31-34 coefficients (represented by 
feedback lines 41,42). Several techniques are available to calculate the filter updates 
(e.g., the least means square (LMS), the recursive least square (RLS), the affine 
projection (AP) algorithms, etc.). All these techniques attempt to minimize these error 
20 signals in one way or another. 

The data available to the echo canceller are the inputs to the loudspeakers, jr,'s, as 
well as the outputs of the microphones, .v,'s, in the receiving room 21. The fundamental 



15 



E.i(o) = Y^ia))-Yi(a)) 
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problem of stereophonic AEC systems is that given this set of data, it is not possible to 
uniquely determine the echo paths to drive the error, e/'s to zero (i.e., to eliminate the 
echoes). In order to explain this, let us look at the error in one of the channels (similar 
analysis can be carried out for the other channels). In the frequency domain, this error is 
given by 

{CO) - X (^.y ico) -H^j (Q)))Gj (OJ) X (OJ) 

j 

Let us assume that somehow, we have been able to achieve perfect echo 
cancellation, i.e., we have £*, (<2;) = 0. Assuming that X{o))docs not have zeros in the 
frequencies of interest, the above gives 

J,(H,j{co)-Hijia)))Gj(a))^0 (3) 

J 

A 

This equation does not imply H^j {CD) = H\j {co) . Therefore, even if the echo has 

been driven to zero, we have not necessarily achieved perfect alignment. In other words, 
the canceller has not necessarily identified the true echo path. In fact, the above equation 

A 

has infinitely many solutions for Hij{co), Any adaptation algorithm may lead to any one 
of these solutions. Note that so long as the conditions in both the transmitting and the 
receiving rooms are fixed, this does not cause any problem as the echo wiJl remain zero. 
However, the adaptation technique has to track not only the changes in the receiving 
room that change the echo path responses, % but also the changes in the conditions in the 
transmitting room as reflected through changes in gi. Tracking the conditions in the 
transmitting room can be specially problematic as gi may change abruptly and by a large 
amount (e.g., one speaker stops talking and another speaker starts speaking from a 
different location). 
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A detailed discussion of this problem describing several viewpoints can be found 
in above cited references. Specially, the discussion in Benesty et al. provides a better 
understanding of the above problem both in terms of non-uniqueness and misalignment of 
the solutions. 

As discussed above, the reason for non-perfect alignment is that the two signals 
are correlated. Correlation between stereo signals do not allow sufficient identification of 
the echo path responses. Thus, in order to solve the problem, we have to find a technique 
to decorrelate the input signals to the loudspeakers, .r„ in such a way that it does not affect 
the stereo perception in the receiving room. 

The system 40 for the stereophonic echo cancellation system is shown in Fig. 2. 
Each of the stereo signals is passed through a different all-pass filter 45, 47 denoted by 
cii(n). The subscript n is used to indicate that the all-pass filter is time-varying (varying 
with n). 

Rigorously speaking, there is no frequency domain representation of the time- 
varying filtering operation used in Fig. 2. However, if we assume that ai(n) does not 
change much for a given window around time instant n, then it is possible to assign a 
frequency domain transfer function A(Q),n) to the filtering operation at time instant n. 
Then the frequency spectra of the output at time instant n can be formally written as 



Y; (CO, n) = J^ H.J {CD) A J (CD, n)Xj (CD) 
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Then the error in the /" path is 

El {(O, n) = X {Q}) - H.J {co))Aj {CO, n)G. (Q})X{Q}) 

J 

Now, if we can achieve perfect echo cancellation by setting £- {0),n) = 0, then the 
above implies 

5 J,(H.j (0)) - H.J {a)))Aj (co, n)Gj (a))X(a) )=0 

J 

Since the above must be taie for all n, i.e., for all variations of Aj{Q),n) with /2, 

A 

we must have H.j{co) - Hij(a)) . Thus, by using the time varying all-pass filter in the 

signal path, it is possible to achieve perfect alignment between the adaptive filter and the 
true echo path. In practice, perfect alignment is not possible due to the finite impulse 
0 response of the modeling filters (the adaptive filters) as well as due to the noise present in 
the signal. However, simulations show that this technique achieves much better 
identification of the echo paths than was otherwise possible. 

The system 40 must follow certain constraints. First, the signals that are modified 

through the all-pass filters 45, 47 are played back through the loudspeaker in the 

) receiving room 21. Therefore, the time-variation of the all-pass filters has to be chosen in 

such a way that does not alter the stereo perception of the speech. Second, since an 

adaptive filter will be used to identify the echo path responses, the time-variation of the 

all-pass filters should be fast enough so that the adaptive technique used cannot track the 

chainges in the all-pass filters. On the other hand, it is desirable that the adaptive 

„ technique be able to track changes in the receiving room 21. These conflicting 
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requirements show the importance of proper choice of the time-varying all-pass filters. In 
the following, we discuss one possible choice. 

The simplest all-pass filter is a single-order filter that can be described by a single 
parameter a.{n) . The frequency response of such a system for a given n can be written 
5 as 

Such a filter has several important features, namely 

• \Af{oj,n)\= l.Oyo) and V/i, i.e., this filter passes all frequencies all the time 
unattenuated. 

) • It only changes the phase of each frequency. 

• It is completely determined by a single time-varying parameter ctr. (/i) . Thus, 
the design of the system involves proper choice of a. (n) . 

In order for the all-pass filter a. (n) to be stable, the absolute value of a. (n) must 
be less than unity. Since all our signal is real, we have also restricted a]in) to be a real 
value. This also simplifies the filtering operation. ai{n)is a time-varying parameter. 
Thus, we need to update a;(n)ai every time instant. The update rule for a^in) is as 
follows 
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a,(/2 + l) = or,(/0 + ^(/0, 



set a. (/z + 1) = or,^^ if a, (/i + 1) > a- ^, 

set a, (/2 + I) = Qr,^.„ if or, (/i + 1) < (4) 

Here, r.(«)is an independent and identically distributed (iid) random variable 
having a uniform probability distribution function (pdO over the interval [-/?/,/?/]. /?/ 
indicates the maximum allowable deviation of or, (/i) from one instant to another. This 
deviation corresponds to phase jitter introduced by the time-varying all-pass filter for the 
channel. /?/ should be made as large as possible to introduce enough signal 
decorrelation. However, too large a value of will result in noticeable change in speech 
perception.. 

^/.max in equation (4), represent the maximum and minimum allow^W€ 

values of a-{n) . In order to ensure stability, we must have cCi^^<\ oc-^.^>-\. 

Further restrictions are also required to maintain transparenp/in speech perception. 
These restrictions are derived from the data known as^^fust noticeable inter-aural delay" 
in psychoacoustics. A discussion of this iy^ound in E. Zwicker and H. Fasti, 
Psychoacoustics: Fac/j a/i^/ Mc'^/e/.y, Hpi^lberg, Germany: Springer- Verlag, 1990. This 
data represents the minimum chafige in the inter-aural time delay between the two ears at 
a given frequency thatts^mses a noticeable change in the perception of the direction of 
sound. The all^p^s filter changes the phase of each frequency of the input speech. The 
effect op^nis phase change is to change the time arrival of the signal at each frequency in 
So, if we limit the phase changes so that the change in the time of arrival for 
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each channel is within the just noticeable inter-aural delay, then spatial perception of 
stereo signal will not be affected. The just noticeable inter-aural delay varies betw^di 30 
jisec. To 200 |isec. We have chosen to limit the change in the time ©^..affival of each 



frequency within 60|isec. This leads to the following values 



= Oand 



^/..in =-0.9 



Fig. 3 shows the time delay as function of frequency for the two all-pass filters 
w*^*^ ^Lmin = -0-9 and a. ^^ = 0. Since the value of a^ ^-^ for the all-pass filters in the 

two stereo paths are kept within these limits, the resulting inter-aural delay are also within 
60 |isec. Our experiments have shown that this choice leads to good signal decorrelation 
to allow correct identification of echo path responses and also keeps the stereo perception 
of speech unchanged. 

In order to evaluate the technique, we collected stereo speech samples in our 
audio laboratory. The audio laboratory was used as the transmitting room. We had two 
speakers talking alternately in the room when two microphones were used to collect the 
data. The data were sampled at 16 kHz sampling rate. In one set of data, the speakers 
were asked to stand still while talking. This was made to ensure that the echo path 
responses remain the same. In another set, they were free to move around the room as 
they talked into the microphones. We then used our technicjue to decorrelate the collected 
stereo signals. We performed informal listening tests by playing the original and the 
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modified stereo signals over both loudspeakers and headphones. All these tests show that 
the stereo perception of the modified signal is indistinguishable from that of the original. 

We simulated the receiving room loudspeaker outputs by convolving the stereo 
signals using the echo path responses /in and h\2^ These two echo path responses were 
obtained using the image method of Allen et al. based on room measurements of one of 
our conference rooms. For more details on Allen, et al. see the following reference: J. 
Allen and D. Berkley, "Image Method for Efficiently Simulating Small-Room 
Acoustics," / AcoiisL Soc, Am., Vol. 65, No. 4, pp. 943-950, April 1979. The 
microphone output in the receiving room was simulated by summing up the outputs of 
these two convolutions. In the above convolutions, we restricted the lengths of the echo 



path responses to be = 4096 samples long. We then used the two adaptive filters hu 



and hi2 each of length L = 2048 samples, to identify these echo path responses. We used 
the fast affine projection technique of order 8 for updating the filter coefficients. See 
Shimauchi et al., a reference cited above. Fig. 4 shows the misalignment in dB with time. 
The misalignment is defined as 



where the subscript 1:2048 is used to indicate that the first 2048 samples of the 
corresponding echo path responses have been used here. This figure corresponds to the 
set of data when the transmitting room echo path responses were kept fixed as already 
described. The dotted line corresponds to the case of original signal and the solid line to 
the case of modified data using our technique of time-varying all-pass filtering. 



A 



10*/o^io 



ll/l, 



11.1:2048 
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Since we have used 'real-world' collected data for the transmitted signals, the 
situation was not as bad as when simulated data was used. We did not experience sudden 
jumps, but misalignment settles down at around -14 dB whereas with our technique of 
signal decorrelation, the misalignment goes below -20 dB. 
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